In this lesson, we will introduce you to 3 public health and air quality datasets. These include the EPA’s Toxic Release Inventory (TRI), the EPA’s Integrated Compliance Information System for Air (ICIS-AIR), and the CDC’s PLACES health outcomes dataset. You will learn how to form queries and retrieve each dataset from their respective API, process the returned object into a useable format, create visualizations and perform simple analysis. You will also use NASA’s Social Vulnerability Index (SVI) to examine socioeconomic patterns in the greater Detroit metropolitan area and explore relationships with public health.
Programming Reminder
This lesson uses the Python programming environment.
Learning Objectives
After completing this lesson, you should be able to:
Retrieve multiple public health datasets through the EPA and CDC’s APIs.
Create multipanel raster maps from multilayered SVI data.
Geolocating tabular data using latitude and longitude.
Create maps joining geolocated points and basemaps.
Create maps with graduated point symbols for point source pollution releases.
Interpolate a raster surface from point spatial data.
Perform raster math to create specialized indexes.
Perform spatial correlation analysis.
Preface: A Note on Data Interpretation and Ethical Considerations
This lesson is designed to introduce you to various environmental, public health, and socioeconomic datasets, and to provide you with the tools to analyze and visualize this information. However, it’s crucial to approach this material with an understanding of its sensitive nature and potential implications.
The topics of environmental justice, institutional racism, socioeconomic conditions, and pollution are complex and multifaceted. The data and analyses presented in this lesson are not intended to draw definitive conclusions or suggest scientific evidence of cause and effect relationships. Rather, they are meant to equip you with the skills to investigate data and perform analyses that could be applied to your local communities.
As you work through this lesson, remember that correlation does not imply causation. The patterns you may observe in the data could be influenced by numerous factors not captured in these datasets. It’s essential to approach your findings with caution and to consider the broader historical, social, and economic contexts that shape environmental and health outcomes in different communities.
This lesson aims to empower you with data literacy and analytical skills. We encourage you to use these tools responsibly, always considering the ethical implications of your analyses and the potential impact on the communities represented in the data. When drawing insights or making decisions based on such analyses, it’s crucial to involve community stakeholders, consider multiple perspectives, and seek additional expertise when necessary.
Introduction
In many urban areas, air quality issues disproportionately affect low-income communities and communities of color. This disparity is a key focus of environmental justice efforts. In cities like Detroit, Chicago, and Cleveland industrial facilities, highways, and other pollution sources are often concentrated in disadvantaged neighborhoods. Detroit has a history of industrial pollution and is working to address air quality issues in areas like Southwest Detroit, where residents face higher exposure to pollutants from nearby industries and heavy truck traffic. Similarly, Chicago has seen community efforts to address air pollution in areas like the Southeast Side, where residents have fought against polluting industries and advocated for stricter regulations.
The EPA is the primary federal agency responsible for environmental protection in the United States. It sets and enforces air quality standards, including the National Ambient Air Quality Standards (NAAQS) for six criteria pollutants. The EPA also maintains the AirNow system, which provides real-time air quality data to the public. CDC (Centers for Disease Control and Prevention): While primarily focused on public health, the CDC plays a crucial role in understanding the health impacts of air pollution. It conducts research on the relationships between air quality and various health outcomes, and provides guidance on protecting public health from environmental hazards.
In response to these challenges, community-driven science initiatives have emerged. These efforts involve local residents in collecting data on air quality and other environmental factors, often using low-cost sensors and mobile monitoring techniques. This approach helps fill gaps in official monitoring networks and empowers communities to advocate for themselves. Open data is crucial for community-driven solutions in several ways:
Transparency: Open data allows communities to verify official reports and hold authorities accountable.
Accessibility: When air quality data is freely available, communities can use it to inform local decision-making and advocacy efforts.
Innovation: Open data enables researchers, activists, and tech developers to create new tools and analyses that can benefit communities.
Collaboration: Open data facilitates collaboration between communities, scientists, and policymakers, leading to more effective solutions.
While the EPA and CDC provide federal networks of open-access datasets like the Air Quality System (AQS), TRI, ICIS-AIR, and PLACES collections, community driven data collection is a large part of driving change in traditionally underrepresented communities. In Chicago, the Array of Things project has installed sensors throughout the city, providing open data on various environmental factors including air quality, while Detroit’s Community Air Monitoring Project uses low-cost sensors to collect and share air quality data in areas underserved by official monitoring stations.
With access to open data, communities can:
Identify local air quality hotspots that may be missed by sparse official monitoring networks.
Correlate air quality data with health outcomes to strengthen advocacy efforts.
Develop targeted interventions, such as promoting indoor air filtration on high-pollution days.
Create custom alerts and information systems tailored to local needs.
Although open data provides many benefits, challenges still remain. These include: 1) ensuring data quality and consistency, especially when integrating data from various sources; 2) addressing the “digital divide” to ensure all community members can access and use the data; 3) balancing the need for detailed local data with privacy concerns; and 4) building capacity within communities to effectively use and interpret complex environmental data.
EJ in the News
The Clear the Air Coalition, a new environmental justice advocacy group in Michigan, is calling for stronger protection of vulnerable communities from pollution. Launched in Detroit, the coalition argues that state regulators focus too much on technical compliance with environmental laws rather than public health. They claim the current permitting process favors polluters and fails to consider the cumulative impacts of pollution on overburdened communities. The group seeks changes in state law and a shift in mindset from environmental regulators.
Clear the Air Coalition
While the Michigan Department of Environment, Great Lakes, and Energy (EGLE) states its commitment to protecting underserved communities and notes improvements in air quality over recent decades, coalition members argue that the current approach is insufficient. They call for regulators to consider the combined effects of multiple pollution sources in marginalized areas and to give local governments more power to reject new polluting facilities in already burdened communities. The coalition plans to raise awareness and push for these changes during the upcoming Air Quality Awareness Week.
The state of Michigan, Department of Environment, Great Lakes, and Energy (EGLE), and the Office of the Environmental Justice Public Advocate are one of the most active state departments aiming to address issues of the environment, public health, and social justice. Here are some of the projects enacted in greater Detroit area:
Detroit Environmental Agenda: This community-led initiative works closely with EGLE to address environmental concerns in Detroit. It focuses on issues like air quality, water quality, and waste management.
48217 Community Air Monitoring Project: Named after the zip code of a heavily industrialized area in Southwest Detroit, this project involves community members working with EGLE to monitor air quality using low-cost sensors.
Detroit Climate Action Plan: Developed in partnership with EGLE, this plan addresses climate change impacts on the city, with a focus on vulnerable communities.
Delray Neighborhood Initiatives: EGLE has been involved in efforts to address air quality concerns in the Delray neighborhood, which is impacted by industrial emissions and heavy truck traffic.
Green Door Initiative: This Detroit-based organization collaborates with EGLE on various environmental justice projects, including lead abatement and air quality improvement efforts.
Detroit River Sediment Cleanup: EGLE has been involved in efforts to clean up contaminated sediments in the Detroit River, which disproportionately affects nearby low-income communities.
Asthma Prevention Programs: EGLE supports community-based asthma prevention programs in Detroit, recognizing the link between air quality and asthma rates in disadvantaged neighborhoods.
These are just a few of examples that demonstrate the ongoing collaboration between community groups, local government, and EGLE to address environmental justice concerns in Detroit.
Air quality issues in urban areas disproportionately affect low-income communities and communities of color, making it a critical focus for environmental justice efforts. Federal agencies like the EPA and CDC play crucial roles in setting standards and conducting research, but community-driven science initiatives have emerged as powerful tools for local action. Open data is key to these efforts, enabling transparency, accessibility, innovation, and collaboration. Cities like Detroit and Chicago are at the forefront of these initiatives, with projects that empower residents to monitor and advocate for better air quality. While challenges remain, particularly in data management and accessibility, the collaboration between community groups, local governments, and state agencies like Michigan’s EGLE demonstrates a promising path forward.
Data Analysis and Exercises
To better understand and address these complex issues, we’ll work through a series of coding and data science exercises. These hands-on activities will allow users to explore some of the open datasets and tools that are available to community members and stakeholders that offer practical insights into air quality, public health, and social vulnerability.
We’ll begin with the ICIS-AIR dataset, then move on to the TRI Facility dataset, and finally work our way down to TRI Form A. After exploring these 3 air pollution datasets, we will take a look at the CDC PLACES health outcomes data and perform some simple analysis integrating the pollution and health outcomes data.
Data Science Review
This lesson uses the following Python modules:
pandas: Essential for data manipulation and analysis.
geopandas: Extends pandas functionality to handle geospatial data.
matplotlib: Used for creating static, animated, and interactive visualizations.
numpy: Provides support for large, multi-dimensional arrays and matrices, along with mathematical functions to operate on these arrays.
requests: Allows you to send HTTP requests and interact with APIs easily.
contextily: Adds basemaps to your plots, enhancing the visual context of your geospatial data.
pygris: Simplifies the process of working with US Census Bureau TIGER/Line shapefiles.
rasterio: Reads and writes geospatial raster datasets.
xarray: Introduces labels in the form of dimensions, coordinates, and attributes on top of raw NumPy-like arrays, making it easier to work with labeled multi-dimensional arrays.
shapely: Manipulation and analysis of geometric objects in the Cartesian plane.
scipy: Used for scientific and technical computing, particularly the stats and interpolate submodules.
rioxarray: Extends xarray to make it easier to handle geospatial raster data.
time: Provides various time-related functions (part of Python’s standard library).
tabulate: Used for creating nicely formatted tables in various output formats.
splot: Provides statistical plots for spatial analysis.
If you’d like to learn more about the functions used in this lesson, you can refer to the documentation on their respective websites.
The pandas module is essential for data manipulation and analysis, while geopandas extends its functionality to handle geospatial data. matplotlib is used for creating static, animated, and interactive visualizations. numpy provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
The requests module allows you to send HTTP requests and interact with APIs easily. contextily adds basemaps to your plots, enhancing the visual context of your geospatial data. pygris simplifies the process of working with US Census Bureau TIGER/Line shapefiles.
rasterio and xarray are used for working with geospatial raster data. rasterio reads and writes geospatial raster datasets, while xarray introduces labels in the form of dimensions, coordinates, and attributes on top of raw NumPy-like arrays, making it easier to work with labeled multi-dimensional arrays.
shapely is used for manipulation and analysis of geometric objects. scipy provides additional tools for scientific computing, including statistical functions and interpolation methods. rioxarray combines the functionality of rasterio and xarray for easier handling of geospatial raster data.
pysal is a library for spatial analysis, providing tools for exploratory spatial data analysis. splot is a visualization module that works with pysal to create statistical plots for spatial analysis.
Make sure these modules are installed before you begin working with the code in this document.
ICIS Air
ICIS-AIR (Integrated Compliance Information System for Air) is a comprehensive database maintained by the U.S. Environmental Protection Agency (EPA) that focuses specifically on air quality compliance and enforcement data. It tracks information related to stationary sources of air pollution, including their compliance with various Clean Air Act regulations, permit data, and enforcement actions. While ICIS-AIR and the Toxic Release Inventory (TRI) are separate systems, they have significant overlap in their coverage of industrial facilities and air emissions. Many facilities that report to TRI also have data in ICIS-AIR, providing complementary information.
Where TRI focuses on the quantities of specific toxic chemicals released into the air, ICIS-AIR offers a broader picture of a facility’s overall air quality compliance status, including non-toxic pollutants regulated under the Clean Air Act. Together, these systems provide a more comprehensive view of industrial air pollution sources, enabling researchers, regulators, and the public to assess both the types and quantities of air pollutants emitted (from TRI) and the regulatory compliance status of the emitting facilities (from ICIS-AIR).
Data Processing
Defining Our Study Area
Next we’ll query the API for ICIS-AIR regulated facilities in the greater Detroit area, but first we need to create a spatial object that defines our study area that we can use throughout this analysis. The U.S. Census Bureau defines the Detroit, MI Metropolitan Statistical Area (MSA) as including 6 counties (Wayne, Macomb, Oakland, Lapeer, Livingston, and St. Clair), but for this lesson we will focus on the core 3 counties (Wayne, Macomb, and Oakland); both population and air emissions related facilities are much more sparse in the outer counties.
We can us pygris to get vector boundaries for the counties and dissolve them into a single boundary we can you for cropping data and restricting API searches.
import geopandas as gpdimport pandas as pdimport pygrisfrom shapely.geometry import boxcounties = ['Wayne', 'Oakland', 'Macomb']# Fetch Detroit metro area counties using pygrismetro_counties = pygris.counties(state="MI", year=2022)detroit_metro = metro_counties[metro_counties['NAME'].isin(counties)]# Dissolve the counties into a single polygondetroit_metro = detroit_metro.dissolve(by='STATEFP')# Convert to GeoDataFramedetroit_metro = gpd.GeoDataFrame(detroit_metro, geometry='geometry', crs='EPSG:4269')# Get the bounding boxbbox = detroit_metro.total_bounds# Create the bounding box as a polygonbbox_polygon = gpd.GeoDataFrame( geometry=[box(*bbox)], crs=detroit_metro.crs)
Using FIPS code '26' for input 'MI'
Let’s verify this by overlaying it on a basemap.
import matplotlib.pyplot as pltimport contextily as ctx# Create the plotfig, ax = plt.subplots(figsize=(7, 7))# Reproject data to Web Mercator to match the basemapdetroit_metro_bm = detroit_metro.to_crs(epsg=3857)bbox_polygon_bm = bbox_polygon.to_crs(epsg=3857)# Plot the metro area and bounding boxdetroit_metro_bm.plot(ax=ax, color='lightblue', edgecolor='darkblue', alpha=0.3)bbox_polygon_bm.boundary.plot(ax=ax, color='grey', linewidth=2)# Add the basemapctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik)# Set the extent of the map to the bounding boxax.set_xlim(bbox_polygon_bm.total_bounds[0], bbox_polygon_bm.total_bounds[2])ax.set_ylim(bbox_polygon_bm.total_bounds[1], bbox_polygon_bm.total_bounds[3])# Remove axesax.set_axis_off()plt.title("Detroit Metro Study Area", fontsize=16)plt.tight_layout()plt.show()
This looks correct, and we’ll use this boundary object repeatedly throughout our analysis.
Querying the API
The EPA’s Enforcement and Compliance History Online (ECHO) website serves as a comprehensive portal for environmental compliance and enforcement data. It includes the ECHO Data Service, which provides programmatic access to EPA data through RESTful APIs. Among these is the ICIS-AIR API, which specifically offers access to air quality compliance and enforcement data. This API allows users to retrieve detailed information about stationary sources of air pollution, including facility details, permit data, emissions reports, and compliance status.
The ECHO API requires that users follow a multi-step workflow to acquire data via the Facility Air API:
Use get_facilities to validate parameters, get summary statistics, and obtain a query_id (QID), valid for about 30 minutes.
Use get_qid with the obtained QID to paginate through facility results.
Use get_map with the QID to visualize and navigate facility locations.
Use get_download with the QID to generate a CSV file of facility information.
The get_facility_info endpoint operates independently, returning either clustered summary statistics or an array of individual facilities. This API structure allows for efficient querying, visualization, and extraction of air quality compliance data, making it a valuable resource for environmental analysis and research.
We’ll begin by querying this database for just Michigan facilities in the workflow detailed above. From there we’ll subset with our list of counties.
We start with the base url for the ECHO API and our parameter list.
# Base URL for ECHO ICIS-AIR APIbase_url ="https://echodata.epa.gov/echo/air_rest_services"# Parameters for the initial API callparams = {"output": "JSON","p_st": "MI"}
Data Science Review
An Application Programming Interface (API) is a set of protocols, tools, and definitions that enable different software applications to communicate with each other. It acts as an intermediary, allowing one application to request specific data or actions from another application or service without needing access to its entire codebase. APIs define the methods and data formats that applications can use to request and exchange information, typically through specific endpoints (URLs) that accept requests and return responses in standardized formats like JSON. They are crucial for integrating external services, accessing databases, and enabling communication between different parts of larger open science software systems.
Now we define 2 functions that will carry out the first 2 steps; 1) identifying the facilities in our search and 2) retrieve the facility data.
import requests# Define the function to query the API for the facilities of interestdef get_facilities(): response = requests.get(f"{base_url}.get_facilities", params=params)if response.status_code ==200: data = response.json()if'Results'in data: qid = data['Results']['QueryID']print(f"Query ID: {qid}")print(f"Total Facilities: {data['Results']['QueryRows']}")return qidprint("Failed to get facilities and QID")returnNone# Define the function to retrieve the relevant data from the facilities of interestdef get_facility_data(qid): all_facilities = [] page =1whileTrue: params = {"qid": qid, "pageno": page, "output": "JSON"} response = requests.get(f"{base_url}.get_qid", params=params)if response.status_code ==200: data = response.json()if'Results'in data and'Facilities'in data['Results']: facilities = data['Results']['Facilities']ifnot facilities: # No more facilities to retrievebreak all_facilities.extend(facilities)print(f"Retrieved page {page}") page +=1else:breakelse:print(f"Failed to retrieve page {page}")breakreturn all_facilities
Now that we’ve defined the functions we can make a request to the API with our Detroit metro counties.
# Step 1: Get the Query IDqid = get_facilities()if qid:# Step 2: Use get_qid to retrieve all facility dataprint("Retrieving facility data...") facilities = get_facility_data(qid)# Convert to DataFrame df_icis_air = pd.DataFrame(facilities)print(f"\nSuccessfully retrieved {len(df_icis_air)} ICIS-AIR facilities for Michigan")print("\nColumns in the dataset:")print(df_icis_air.columns)else:print("Failed to retrieve facility data")
There are over three thousand facilities in the ICIS-AIR Michigan dataset. Here are the descriptions for some key columns.
Key Columns in the ICIS-AIR ECHO API Dataset
AIRName: The name of the facility
SourceID: A unique identifier for the air pollution source
AIRStreet, AIRCity, AIRState, AIRZip: Address components of the facility
RegistryID: A unique identifier for the facility in the EPA’s registry
AIRCounty: The county where the facility is located
AIREPARegion: The EPA region responsible for the facility
AIRNAICS: North American Industry Classification System code(s) for the facility
FacLat, FacLong: Latitude and longitude coordinates of the facility
AIRPrograms: Air quality programs applicable to the facility
AIRStatus: Current operational status of the facility
AIRUniverse: Categorization of the facility within air quality regulation
AIRComplStatus: Current compliance status under the Clean Air Act
AIRHpvStatus: High Priority Violator status
AIRQtrsWithViol: Number of quarters with violations
AIRLastViolDate: Date of the most recent violation
AIREvalCnt: Count of evaluations conducted
AIRLastEvalDate: Date of the most recent evaluation
AIRFceCnt: Count of Full Compliance Evaluations
AIRLastFceDate: Date of the last Full Compliance Evaluation
These columns provide information about each facility’s location, operational characteristics, compliance history, and regulatory oversight under the Clean Air Act. It includes information on violations, evaluations, and compliance status that assess a facility’s environmental performance and regulatory compliance.
We can check some basic information like how many facilities are in Detroit metro and the breakdown by our 3 counties.
# Subset the dataframe to include only the Detroit metro countiesicis_air_detroit = df_icis_air[df_icis_air['AIRCounty'].isin(counties)]# Print information about the subsetprint(f"Total ICIS-AIR facilities in Michigan: {len(df_icis_air)}")print(f"ICIS-AIR facilities in Detroit metro area: {len(icis_air_detroit)}")# Display the count of facilities in each metro countyprint("\nFacilities per county:")print(icis_air_detroit['AIRCounty'].value_counts())
Total ICIS-AIR facilities in Michigan: 3395
ICIS-AIR facilities in Detroit metro area: 903
Facilities per county:
AIRCounty
Wayne 431
Oakland 272
Macomb 200
Name: count, dtype: int64
Next we should check for facilities with missing latitude or longitude coordinates.
# Count records with missing coordinate valuesmissing_coords = icis_air_detroit[(icis_air_detroit['FacLat'].isnull()) | (icis_air_detroit['FacLong'].isnull())]print(f"Number of ICIS-AIR records with missing coordinates: {len(missing_coords)}")# Remove records with missing coordinatesicis_air_detroit = icis_air_detroit.dropna(subset=['FacLat', 'FacLong'])
Number of ICIS-AIR records with missing coordinates: 0
No missing records! That’s great. As you’ll find out later, you won’t always be so lucky.
Now we can create a spatial object, reproject it so it matches the basemap of Detroit, and make a map showing the location of all the ICIS-AIR facilities in Detroit metro.
# Create a GeoDataFrame for ICIS-AIR facilitiesgdf_icis_air = gpd.GeoDataFrame( icis_air_detroit, geometry=gpd.points_from_xy(icis_air_detroit.FacLong, icis_air_detroit.FacLat), crs="EPSG:4326")# Reproject ICIS-AIR data to Web Mercator so it matches the basemapgdf_icis_air_bm = gdf_icis_air.to_crs(epsg=3857)# Create the plotfig, ax = plt.subplots(figsize=(7, 7))# Plot the metro area and bounding box (reusing objects from earlier)detroit_metro_bm.plot(ax=ax, facecolor='none', edgecolor='blue', linewidth=2)bbox_polygon_bm.boundary.plot(ax=ax, color='#315c86', linewidth=3)# Plot ICIS-AIR facilitiesgdf_icis_air_bm.plot(ax=ax, color='cyan', markersize=50, alpha=0.5, label='ICIS-AIR Facilities', edgecolor ="grey")# Add the basemapctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik)# Set the extent of the map to the bounding boxax.set_xlim(bbox_polygon_bm.total_bounds[0], bbox_polygon_bm.total_bounds[2])ax.set_ylim(bbox_polygon_bm.total_bounds[1], bbox_polygon_bm.total_bounds[3])# Remove axesax.set_axis_off()# Add legendax.legend()plt.title("Detroit Metro Area ICIS-AIR Facilities", fontsize=16)plt.tight_layout()plt.show()print(f"Number of ICIS-AIR facilities plotted: {len(gdf_icis_air)}")
Number of ICIS-AIR facilities plotted: 903
These are distributed along business and industrial sectors as expected, but we can’t infer much else from this map.
Which regulated facilities have been in violation status? We can plot all the facilities with at least one violation during a reporting quarter and use graduated symbols that reflect the total number of quarters with violations. This information is in the AIRQtrsWithViol column.
# Convert to numeric, replacing non-numeric values with NaNgdf_icis_air_bm['AIRQtrsWithViol'] = pd.to_numeric(gdf_icis_air_bm['AIRQtrsWithViol'], errors='coerce')# subset the dataset for those with violationsgdf_icis_air_violators = gdf_icis_air_bm = gdf_icis_air_bm[gdf_icis_air_bm['AIRQtrsWithViol'] >0]# Create the plotfig, ax = plt.subplots(figsize=(7, 7))# Plot the metro area and bounding box (reusing objects from earlier)detroit_metro_bm.plot(ax=ax, facecolor='none', edgecolor='blue', linewidth=2)bbox_polygon_bm.boundary.plot(ax=ax, color='#315c86', linewidth=3)# Plot TRI facilities with graduated symbols based on air releasesscatter = ax.scatter(gdf_icis_air_violators.geometry.x, gdf_icis_air_violators.geometry.y, s=gdf_icis_air_violators['AIRQtrsWithViol']*10, # Adjust the multiplier as needed c='orangered', edgecolor='yellow', linewidth=1, alpha=0.7)# Add the basemapctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik)# Set the extent of the map to the bounding boxax.set_xlim(bbox_polygon_bm.total_bounds[0], bbox_polygon_bm.total_bounds[2])ax.set_ylim(bbox_polygon_bm.total_bounds[1], bbox_polygon_bm.total_bounds[3])# Remove axesax.set_axis_off()# Add legendax.legend()plt.title("Detroit Metro Area ICIS-AIR Facilities w/ Violations", fontsize=16)plt.tight_layout()plt.show()
We can also create a table showing which businesses have been in violation statues, their name, address, and the date of their most recent violation using the tabulate module.
from tabulate import tabulategdf_icis_air_violators = gdf_icis_air_bm = gdf_icis_air_bm[gdf_icis_air_bm['AIRQtrsWithViol'] >=10]# List the columns we want to usetable_columns = ['AIRName', 'AIRStreet', 'AIRCity', 'AIRQtrsWithViol', 'AIRLastViolDate']# Create a new DataFrame with only the columns we wanttable_data = gdf_icis_air_violators[table_columns].copy()# Ensure AIRLastViolDate is in datetime formattable_data['AIRLastViolDate'] = pd.to_datetime(table_data['AIRLastViolDate'])# Sort the DataFrame by AIRLastViolDate in descending ordertable_data = table_data.sort_values('AIRLastViolDate', ascending=False)# Create a dictionary to map old column names to pretty column namespretty_names = {'AIRName': 'Name','AIRStreet': 'Street','AIRCity': 'City','AIRQtrsWithViol': 'Violations','AIRLastViolDate': 'Violation Date'}# Rename the columnstable_data.rename(columns=pretty_names, inplace=True)# Format the 'Violation Date' column to a specific date format (optional)table_data['Violation Date'] = table_data['Violation Date'].dt.strftime('%Y-%m-%d')# Create the tabletable = tabulate(table_data, headers='keys', tablefmt='html', showindex=False)table
Name
Street
City
Violations
Violation Date
EES COKE BATTERY L.L.C.
1400 ZUG ISLAND ROAD
RIVER ROUGE
12
2024-04-23
ROSATI SPECIALTIES
24200 CAPITAL BLVD.
CLINTON TWP
12
2024-03-21
CARMEUSE LIME INC, RIVER ROUGE OPERATION
25 MARION AVE
RIVER ROUGE
12
2023-08-17
BASF CORPORATION - CHEMICAL PLANTS
1609 BIDDLE AVE
WYANDOTTE
12
2023-03-24
MARATHON PETROLEUM COMPANY LP
1001 S OAKWOOD
DETROIT
12
2023-03-15
FCA US LLC - DETROIT ASSEMBLY COMPLEX
2101 CONNER AVE
DETROIT
12
2023-01-25
U S STEEL GREAT LAKES WORKS
1 QUALITY DR
ECORSE
12
2022-11-07
FCA US LLC WARREN TRUCK ASSEMBLY PLANT
21500 MOUND ROAD
WARREN
12
2022-10-21
NYLOK LLC
15260 HALLMARK COURT
MACOMB
12
2022-08-29
AJAX METAL PROCESSING INC.
4651 BELLEVUE AVE
DETROIT
10
2022-06-07
STERLING ENGINE HOLDINGS LLC
54420 PONTIAC TRAIL
MILFORD
12
2021-06-10
DETROIT RENEWABLE POWER, LLC
5700 RUSSELL ST
DETROIT
12
2019-08-29
JVISFH, LLC
23944 FREEWAY PARK DRIVE
FARMINGTN HLS
12
2019-03-22
LEAR CORPORATION DBA EAGLE OTTAWA
2930 WEST AUBURN RD
ROCHESTER HLS
11
2014-11-10
BASF CORPORATION - PLASTIC PLANTS
1609 BIDDLE AVE.
WYANDOTTE
12
2014-06-24
WYANDOTTE DEPT MUNI POWER PLANT
2555 VAN ALSTYNE
WYANDOTTE
12
2014-04-21
POWER SOLUTIONS INTERNATIONAL
32505 INDUSTRIAL DRIVE
MADISON HTS
12
2013-12-09
HENRY FORD WEST BLOOMFIELD HOSPITAL
6777 WEST MAPLE ROAD
W BLOOMFIELD
12
2012-05-03
HD INDUSTRIES
19455 GLENDALE
DETROIT
12
2012-04-18
ICIS-AIR is a powerfule tool for communites and policy-makers to identify regulated facilities and their compliance status, but the dataset does have limitations. Next, we’ll take a look at facilities required to report to the Toxic Release Inventory (TRI).
Knowledge Check
Which of the following best describes the ICIS-AIR database?
A database of air quality measurements from monitoring stations
A comprehensive database of air quality compliance and enforcement data
A public health database tracking respiratory illnesses
A database of weather patterns affecting air quality
Toxic Release Inventory
The Toxic Release Inventory (TRI) is a crucial resource in understanding and addressing air quality issues in the United States. Established under the Emergency Planning and Community Right-to-Know Act of 1986, the TRI is a publicly accessible database maintained by the Environmental Protection Agency (EPA). It requires certain industrial facilities to report annually on their releases of toxic chemicals into the environment, including air emissions. The TRI provides valuable data on over 770 chemicals and chemical categories, offering insights into the types and quantities of pollutants released into the air by various industries. This information is vital to researchers, policymakers, and community advocates in assessing local air quality, identifying potential health risks, and developing targeted strategies to reduce toxic air emissions. By making this data publicly available, the TRI plays an important role in promoting transparency and supporting environmental justice initiatives focused on improving air quality in communities across the nation.
Data Review
The Toxics Release Inventory (TRI) and the Integrated Compliance Information System for Air (ICIS-AIR) are two important but distinct environmental reporting systems maintained by the U.S. Environmental Protection Agency (EPA). They have several key differences:
Regulatory Basis
TRI: Established under the Emergency Planning and Community Right-to-Know Act (EPCRA) of 1986
ICIS-AIR: Part of the Clean Air Act (CAA) compliance and enforcement program
Focus
TRI: Tracks the management of certain toxic chemicals that may pose a threat to human health and the environment
ICIS-AIR: Focuses specifically on air quality and emissions from facilities regulated under the Clean Air Act
Reported Information
TRI: Facilities report on releases, waste management, and pollution prevention activities for specific toxic chemicals
ICIS-AIR: Tracks emissions data, compliance status, and enforcement actions related to air quality regulations
Facility Coverage
TRI: Covers facilities in specific industries that manufacture, process, or use TRI-listed chemicals above certain thresholds
ICIS-AIR: Includes a broader range of facilities that emit air pollutants, regardless of the specific chemicals involved
Reporting Thresholds
TRI: Has specific chemical thresholds that trigger reporting requirements
ICIS-AIR: Generally doesn’t have chemical-specific thresholds; requirements are based on overall emissions and facility type
Public Accessibility
TRI: Designed with a strong focus on public right-to-know, with data easily accessible to the public
ICIS-AIR: While public, it’s primarily designed for regulatory and enforcement purposes
Data Frequency
TRI: Annual reporting is required for covered facilities
ICIS-AIR: May involve more frequent reporting, depending on permit requirements and compliance status
Scope of Pollutants
TRI: Focuses on a specific list of toxic chemicals and chemical categories
ICIS-AIR: Covers a wider range of air pollutants, including criteria air pollutants and hazardous air pollutants
Use in Environmental Management
TRI: Often used for assessing long-term trends in toxic chemical releases and waste management practices
ICIS-AIR: More commonly used for day-to-day air quality management and enforcement activities
Geographic Coverage
TRI: Nationwide program with consistent reporting across states
ICIS-AIR: While national, implementation can vary more by state or local air quality management district
Envirofacts API
EnviroFacts is a comprehensive online database and information system maintained by the U.S. Environmental Protection Agency (EPA). It serves as a centralized hub for accessing a wide range of environmental data collected by the EPA and other federal agencies. EnviroFacts integrates information from multiple EPA databases, covering various aspects of environmental health, including air quality, water quality, hazardous waste, and toxic releases. One of the key components of EnviroFacts is the Toxic Release Inventory (TRI) data. Through EnviroFacts, users can easily access and query TRI information, allowing them to investigate toxic chemical releases and waste management activities in their local areas. The integration of TRI data within the broader EnviroFacts system enables researchers, policymakers, and community members to contextualize toxic release information alongside other environmental indicators.
Envirofacts is available as a web based search and data platform and also as a programmatic API. The web platform is a great way to familiarize yourself with the available datasets and create simple downloads, however, for analytical purposes we recommend learning to navigate their API so you can create repeatable and reliable analysis.
The agreement, described as “groundbreaking” by the Sierra Club, introduces several new measures. EGLE will conduct environmental justice and cumulative impact analyses for future licensing decisions, potentially denying licenses that would have an unlawful impact on human health and the environment. The settlement also includes provisions for improved community engagement, such as better translation services, public input processes, and the installation of air monitors around U.S. Ecology North. Additionally, the state will work with local residents to conduct a community health assessment. While some details remain to be clarified, environmental advocates view this as a significant step towards advancing environmental justice in Michigan.
Querying the API
We’ll continue to use the Detroit metro boundary we created earlier. We assigned them the names detroit_metro and detroit_metro_bm (projected to mercator to match basemaps for plotting).
# Print the bounding box coordinatesprint("Bounding Box:")print(f"Minimum X (Longitude): {bbox[0]}")print(f"Minimum Y (Latitude): {bbox[1]}")print(f"Maximum X (Longitude): {bbox[2]}")print(f"Maximum Y (Latitude): {bbox[3]}")
Bounding Box:
Minimum X (Longitude): -83.689438
Minimum Y (Latitude): 42.02793
Maximum X (Longitude): -82.705966
Maximum Y (Latitude): 42.897541
Now we’ll create and empty list to store the TRI data, loop through each county making an API query, and place the retrieved data in the empty container.
# Fetch TRI facility data from EPA API for each countytri_data = []# Loop through each county to make a separate API inquiry (was having trouble attempting all at once)for county in counties: api_url =f"https://data.epa.gov/efservice/tri_facility/state_abbr/MI/county_name/{county}/JSON" response = requests.get(api_url)# Is our response goodif response.status_code ==200: county_data = response.json()# Add the county to the list tri_data.extend(county_data)else:print(f"Failed to fetch data for {county} County. Status code: {response.status_code}")
Processing the Data
The TRI data comes in json format. They can be a little confusing to interpret, becauce each record (traditionally a row in a table) is viewed with columns structured vertically. Let’s look at the first record.
This record shows all the information for PPG GROW DETROIT with columns and values listed side-by-side (‘city_name’: ‘DETROIT’). This may seem difficult to deal with, but most programming languages have tools to easily parse json data into traditional tables. pd.DataFrame does it automatically when you attempt to create a pandas data frame.
# Convert TRI data to a DataFrametri_df = pd.DataFrame(tri_data)print(f"Number of facilities fetched: {len(tri_df)}")# Check the first few rows (too large to print in this document)# tri_df.head
Number of facilities fetched: 789
Checking the first few rows–everything looks as it should. We want to geocode the facility locations into a point layer using the latitude and longitude values. Therefore, we should start by making sure all the facilities have latitude and longitude values.
We’ll check and remove those without.
# Create a copy of the dataframe to avoid SettingWithCopyWarningtri_df_clean = tri_df.copy()# Remove facilities with empty latitude or longitude valuestri_df_clean = tri_df_clean.dropna(subset=['pref_latitude', 'pref_longitude'])print(f"Number of facilities after removing empty coordinates: {len(tri_df_clean)}")# Convert latitude and longitude to numeric typetri_df_clean['pref_latitude'] = pd.to_numeric(tri_df_clean['pref_latitude'], errors='coerce')tri_df_clean['pref_longitude'] = pd.to_numeric(tri_df_clean['pref_longitude'], errors='coerce')
Number of facilities after removing empty coordinates: 478
Ouch! We lost roughly 300 records due to missing coordinates (789 vs. 478). If this were an important analysis for policy-making or scientific research we would take the time to geocode the listed addresses into lat/long coordinates, but for the purposes of this exercise we’ll move on.
In my personal investigations of this data, I noticed that some longitude coordinates were not negative, but had an absolute value (e.g. 83.25) that made sense for the Detroit metro area, therefore we will flip the sign on these records.
# Function to correct longitude by making it negative if positivedef correct_longitude(lon):if lon >0:return-lonreturn lon# Apply longitude correctiontri_df_clean['pref_longitude'] = tri_df_clean['pref_longitude'].apply(correct_longitude)
Additionally, there are some records with wild longitudinal values that are not simply from a missing negative sign. We’ll identify these outliers using the interquantile range. Anything outside the range will be tosses. One of the downsides of the TRI database is that many aspects are provided directly by the facility (and therefore subject to errors). Again, if this were a more important analysis we would carefully explore all of these missing records and possible geocode using the address.
Number of facilities after removing longitude outliers: 473
Thankfully we only lost 2 more records from outlier.
Visualizing the Data
The data has been cleaned up a bit so now lets create a spatial point object from the data with geopandas and use matplotlib to plot the values on top of a basemap of Detroit we’ll acquire using contextily and our original Wayne, Macomb, and Oakland counties boundary.
# Create a GeoDataFrame from the cleaned TRI datadetroit_tri = gpd.GeoDataFrame( tri_df_clean, geometry=gpd.points_from_xy(tri_df_clean.pref_longitude, tri_df_clean.pref_latitude), crs="EPSG:4326")# Reproject data to Web Mercator for contextily# detroit_metro = detroit_metro.to_crs(epsg=3857)# bbox_polygon = bbox_polygon.to_crs(epsg=3857)detroit_tri = detroit_tri.to_crs(epsg=3857)# Create the plotfig, ax = plt.subplots(figsize=(7, 7))# Plot the metro area and bounding box (reusing objects from earlier)detroit_metro_bm.plot(ax=ax, facecolor='none', edgecolor='blue', linewidth=2)bbox_polygon_bm.boundary.plot(ax=ax, color='#315c86', linewidth=3)# Plot TRI facilities (reusing the detroit_tri object from earlier)detroit_tri.plot(ax=ax, color='purple', edgecolor='grey', markersize=50, alpha=0.5, label='TRI Facilities')# Plot ICIS-AIR facilitiesgdf_icis_air_bm.plot(ax=ax, color='cyan', edgecolor='grey', markersize=50, alpha=0.5, label='ICIS-AIR Facilities')# Add the basemapctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik)# Set the extent of the map to the bounding boxax.set_xlim(bbox_polygon_bm.total_bounds[0], bbox_polygon_bm.total_bounds[2])ax.set_ylim(bbox_polygon_bm.total_bounds[1], bbox_polygon_bm.total_bounds[3])# Remove axesax.set_axis_off()# Add legendax.legend()plt.title("Detroit Metro Area TRI and ICIS-AIR Facilities", fontsize=16)plt.tight_layout()plt.show()print(f"Number of TRI facilities plotted: {len(detroit_tri)}")print(f"Number of ICIS-AIR facilities plotted: {len(gdf_icis_air)}")
Number of TRI facilities plotted: 473
Number of ICIS-AIR facilities plotted: 903
We can see that there are far fewer facilities being tracked in TRI compared to ICIS-AIR. While some of this can be attributed to the records we tossed after checking for valid geographic coordinates and outliers, but generally speaking ICIS-AIR will contain far more as it tracks every facility regulated by the Clean Air Act as opposed to only those that are part of TRI.
You may have also noticed that this dataset does not contain any actually pollution data; only facility information for those that are regulated by TRI.
TRI is made up of numerous tables containing a wealth of information regarding the facilities themselves and all the regulated chemicals that are handled at each site. The TRI_FACILITY table we requested is the starter table used in most of the examples in the Envirofacts API documentation, however, it does not contain any chemical release data.
Navigating the TRI Envirofacts interface can be overwhelming at times. Multiple tables can be queried with a single URL, however, for a proper join to be carried out both tables must contain at least 1 column. In most cases this can be accomplished using TRI_FACILITY_ID. That said, air pollution release data (column name AIR_TOTAL_RELEASE) is only found in the TRI_FORM_R table, which does not contain the FACILITY_ID column.
You can querry the V_TRI_FORM_R_EZ using the API even though it’s not a documented table listed in the API website. This form contains facility information and AIR_TOTAL_RELEASE reporting, but it only goes until 2021. To get the most recent data (2023) you would have to manually download the individual tables with separate calls perform the joins, or use the custom form search web portal. We’ll demonstrate a quick API call for the V_TRI_FORM_R_EZ, but move forward with the most recent data available in the custom form search web portal.
Knowledge Check
What is the primary purpose of the Toxic Release Inventory (TRI)?
To track emissions of greenhouse gases
To monitor compliance with the Clean Air Act
To provide data on toxic chemical releases from industrial facilities
To record air quality index values for major cities
Toxic Release Inventory: Form R
We can make the query to Envirofacts. Let’s breakdown the call.
# URL of the CSV fileurl ="https://data.epa.gov/efservice/V_TRI_FORM_R_EZ/REPORTING_YEAR/2021/AIR_TOTAL_RELEASE/>/0/STATE_ABBR/MI/COUNTY_NAME/CONTAINING/WAYNE/CONTAINING/OAKLAND/CONTAINING/MACOMB/CSV"
Base url: https://data.epa.gov/efservice/
The table we’re requesting: V_TRI_FORM_R_EZ/
For just 2021: REPORTING_YEAR/2021/
Only facilities that report some air release: AIR_TOTAL_RELEASE/>/0/
# Read the CSV file directly with pandastri_form_r = pd.read_csv(url)print(f"Successfully read CSV. Number of records: {len(tri_form_r)}")# Display information about the datasetprint("\nColumns in the dataset:")print(tri_form_r.columns)
Successfully read CSV. Number of records: 413
Columns in the dataset:
Index(['tri_facility_id', 'facility_name', 'street_address', 'city_name',
'county_name', 'state_county_fips_code', 'state_abbr', 'zip_code',
'region', 'fac_closed_ind',
...
'additional_text_9_1', 'srs_id', 'media_type', 'prod_ratio_or_activity',
'industry_code', 'industry_description', 'source', 'method', 'adjusted',
'covered_naics'],
dtype='object', length=568)
We have roughly the same number of facilities as before, but this table has a lot of columns (568). We should make it more manageable.
This works well but we can get more recent data that is already filtered using the web portal. The user selects location, years of interest, and columns of interests. The portal serves up your requested dataset in an Amazon S3 cloud storage. We can important the dataset directly from the address they provide you.
tri_form_r = pd.read_csv("https://dmap-epa-enviro-prod-export.s3.amazonaws.com/396975438.CSV")# Display information about the datasetprint(f"Successfully read CSV. Number of records: {len(tri_form_r)}")print("\nColumns in the dataset:")print(tri_form_r.columns)
Successfully read CSV. Number of records: 994
Columns in the dataset:
Index(['AIR_TOTAL_RELEASE', 'CHEM_NAME', 'FACILITY_NAME', 'LATITUDE',
'LONGITUDE', 'STREET_ADDRESS', 'TRI_FACILITY_ID'],
dtype='object')
This is a bit more manageable, but we have multiple records per facility because each chemical release has a separate record. This allows the user to investigate specific chemicals, but for the purposes of this exercise we will aggregate AIR_TOTAL_RELEASE for each facility.
# Check how it looks (too large to print on the web)# tri_form_r.head
Let’s aggregate it.
# Sum across individual facilitiestri_form_r = tri_form_r.groupby(['FACILITY_NAME', 'LATITUDE','LONGITUDE','STREET_ADDRESS'], as_index=False).agg({'AIR_TOTAL_RELEASE': 'sum'})# How many records are there nowprint(f"Successfully read CSV. Number of records: {len(tri_form_r)}")
Successfully read CSV. Number of records: 211
There were only 211 unique listings in the 2023 data.
As before, we should check for valid coordinates and release data.
# Assuming the latitude and longitude columns are named 'LATITUDE' and 'LONGITUDE'# Adjust these names if they're different in your CSVlat_col ='LATITUDE'lon_col ='LONGITUDE'release_col ='AIR_TOTAL_RELEASE'# Adjust this to the actual column name for air releases# Remove records with missing coordinates or air release datadf_tri_clean = tri_form_r.dropna(subset=[lat_col, lon_col, release_col])print(f"\nNumber of records after removing missing data: {len(df_tri_clean)}")
Number of records after removing missing data: 211
There were no missing coordinates are release data.
Now we can create a spatial object with GeoPandas using the latitude and longitude coordinates in the table.
# Create a GeoDataFramegdf_tri_form_r = gpd.GeoDataFrame( df_tri_clean, geometry=gpd.points_from_xy(df_tri_clean[lon_col], df_tri_clean[lat_col]), crs="EPSG:4326")
Let’s check the distribution of the air release data. I suspect there will be outliers.
# Create the histogramplt.figure(figsize=(8, 8))plt.hist(gdf_tri_form_r['AIR_TOTAL_RELEASE'], bins=10, edgecolor='black')plt.title('Histogram of Air Total Release Sum')plt.xlabel('Air Sum')plt.ylabel('Frequency')plt.grid(True, alpha=0.3)# Show the plotplt.show()
Looks like there are outliers around 100,000 lbs and 400,000 lbs. Before we map the data, it would be a good idea to perform a log transformation of AIR_TOTAL_RELEASE to smooth out the visuals. You can’t take the log of negative or zero values, so we’ll use the log1p function, which adds 1 to every value before taking the log.
import numpy as np# Assuming 'result' is your DataFrame from earliergdf_tri_form_r['LOG_AIR_RELEASE'] = np.log1p(gdf_tri_form_r['AIR_TOTAL_RELEASE'])# Create the histogramplt.figure(figsize=(8, 8))plt.hist(gdf_tri_form_r['LOG_AIR_RELEASE'], bins=10, edgecolor='black')plt.title('Histogram of the Log of Air Total Release Sum')plt.xlabel('Log of Air Sum')plt.ylabel('Frequency')plt.grid(True, alpha=0.3)# Show the plotplt.show()
That looks much better.
Now we can visualize the data, by overlaying the points on a basemap of Detroit while using graduated symbols to illustrate the amount of air releases for a given facility.
# Reproject to Web Mercator to match the basemapgdf_tri_form_r_bm = gdf_tri_form_r.to_crs(epsg=3857)# Create the plotfig, ax = plt.subplots(figsize=(7, 7))# Plot the metro area and bounding box (reusing objects from earlier)detroit_metro_bm.plot(ax=ax, facecolor='none', edgecolor='blue', linewidth=2)bbox_polygon_bm.boundary.plot(ax=ax, color='#315c86', linewidth=2)# Plot TRI facilities with graduated symbols based on air releasesscatter = ax.scatter(gdf_tri_form_r_bm.geometry.x, gdf_tri_form_r_bm.geometry.y, s=gdf_tri_form_r_bm['LOG_AIR_RELEASE']*20, # Adjust the scaling factor as needed c='orangered', # Static fill color edgecolor='yellow', # Outline color linewidth=1, # Adjust the outline width as needed alpha=0.7)# Add the basemapctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik)# Set the extent of the map to the bounding boxax.set_xlim(bbox_polygon_bm.total_bounds[0], bbox_polygon_bm.total_bounds[2])ax.set_ylim(bbox_polygon_bm.total_bounds[1], bbox_polygon_bm.total_bounds[3])# Remove axesax.set_axis_off()# Add a legend for symbol sizeslegend_sizes = [0,4,8,12] # Example sizes, adjust based on your datalegend_elements = [plt.scatter([], [], s=size*20, c='orangered', edgecolor='yellow', linewidth=1, alpha=1, label=f'{size:,}') for size in legend_sizes]ax.legend(handles=legend_elements, title='Log Total Air Releases (lbs)', loc='lower right', title_fontsize=12, fontsize=10)plt.title("Detroit Metro Area TRI Facilities - Total Air Releases (Custom Data)", fontsize=16)plt.tight_layout()plt.show()print(f"\nNumber of TRI facilities plotted: {len(gdf_tri_form_r)}")print(f"Total air releases: {gdf_tri_form_r[release_col].sum():,.2f} lbs")print(f"Average air release per facility: {gdf_tri_form_r[release_col].mean():,.2f} lbs")
Number of TRI facilities plotted: 211
Total air releases: 2,117,167.80 lbs
Average air release per facility: 10,033.97 lbs
The point source air release data provides a good snapshot of the locations and relative amounts of chemical releases. With some basic understanding of the demographic dynamics of Detroit one might theorize who is being subjected to high levels of pollution, however, we’ll need additional data to develop any type of systematic analysis.
Knowledge Check
What key information does Form R of the Toxic Release Inventory provide?
Facility names and locations only
Chemical storage capacities of facilities
Detailed data on chemical releases, including air releases
Employee health records related to chemical exposure
Social Vulnerability Index
NASA’s Social Vulnerability Index (SVI) raster dataset, available through the Socioeconomic Data and Applications Center (SEDAC), is a high-resolution geospatial resource that quantifies social vulnerability across the United States. This dataset is based on the CDC’s Social Vulnerability Index but is presented in a raster format, providing continuous coverage at a 250-meter resolution. The SVI incorporates various socioeconomic and demographic factors such as poverty, lack of vehicle access, crowded housing, and minority status to assess communities’ capacity to prepare for, respond to, and recover from hazards, including environmental threats like poor air quality.
The raster format allows for more detailed spatial analysis and integration with other environmental datasets. This makes it particularly valuable for researchers and policymakers studying the intersection of social vulnerability and environmental risks, such as air pollution exposure. By overlaying this SVI data with air quality information, for instance, analysts can identify areas where socially vulnerable populations may be disproportionately affected by poor air quality, supporting environmental justice initiatives and targeted intervention strategies.
SEDAC, as part of NASA’s Earth Observing System Data and Information System (EOSDIS), hosts this dataset along with other socioeconomic and environmental data, facilitating interdisciplinary research on human-environment interactions. The SVI raster dataset’s high resolution and comprehensive coverage make it a powerful tool for assessing environmental equity and informing policy decisions at various geographic scales.
Although the SVI dataset is free to acquire, it does require an Earth Data account. If you don’t already have an Earth Data account you can follow these steps to download the SVI dataset on your local computer:
Visit the NASA Earthdata website: Go to https://urs.earthdata.nasa.gov/ Click on “Register”: Look for the “Register” button on the top right corner of the page.
Fill out the registration form: Provide the required information, including your name, email address, and a password. You’ll also need to create a username.
Verify your email: NASA will send a verification email to the address you provided. Click on the link in this email to confirm your account.
Log in to Earth Data: Once your account is verified, you can log in using your username and password.
Access SEDAC: Visit the SEDAC website (https://sedac.ciesin.columbia.edu/) and use your Earth Data credentials to log in when prompted.
Download the data: Once you’ve found the SVI dataset, you can use your Earth Data account to download it.
Remember, your Earth Data account gives you access not just to SEDAC, but to a wide range of NASA Earth science data. It’s a valuable resource for researchers, students, and anyone interested in environmental and socioeconomic data.
Data Processing
Once you’ve downloaded the dataset to your working directory, you can proceed with the analysis. We’ll be using the 2020 census tract version of SVI.
The different layers of SVI are provided as individual files, but sometimes it’s easier to work with a multilayer object. We can create one using xarray. To begin, we’ll read in each file individually, clip it to our border of Detroit metro, and create an individual data array.
import xarray as xrimport rasterioimport rasterio.mask# Specify the TIF filestif_files = ["data/svi/svi_2020_tract_overall_wgs84.tif","data/svi/svi_2020_tract_minority_wgs84.tif","data/svi/svi_2020_tract_socioeconomic_wgs84.tif","data/svi/svi_2020_tract_housing_wgs84.tif","data/svi/svi_2020_tract_household_wgs84.tif"]# Create an empty list to store the individual DataArraysdata_arrays = []# Read each TIF file, clip it to Detroit metro's extent, and append it to the listforfilein tif_files:with rasterio.open(file) as src:# Reproject Detroit metro boundary to match the raster CRS metro_reprojected = detroit_metro.to_crs(src.crs)# Clip the raster to Detroit metro's geometry out_image, out_transform = rasterio.mask.mask(src, metro_reprojected.geometry, crop=True) out_meta = src.meta.copy()# Update the metadata out_meta.update({"driver": "GTiff","height": out_image.shape[1],"width": out_image.shape[2],"transform": out_transform})# Create coordinates height = out_meta['height'] width = out_meta['width'] cols, rows = np.meshgrid(np.arange(width), np.arange(height)) xs, ys = rasterio.transform.xy(out_transform, rows, cols)# Convert lists to numpy arrays xs = np.array(xs) ys = np.array(ys)# Reshape coordinates to match dimensions of the raster xs = xs.reshape(height, width) ys = ys.reshape(height, width)# Create a DataArray from the clipped data da = xr.DataArray(out_image[0], # Use the first band coords={'y': ('y', ys[:, 0]),'x': ('x', xs[0, :])}, dims=['y', 'x']) da.attrs['crs'] =str(src.crs) # Convert CRS to string da.attrs['transform'] = out_transform data_arrays.append(da)
Now we can combine them together and give the layers “pretty” names.
# Combine all DataArrays into a single DataSetsvi_detroit = xr.concat(data_arrays, dim='layer')# Rename the layerslayer_names = ['Overall', 'Minority', 'Socioeconomic', 'Housing', 'Household']svi_detroit = svi_detroit.assign_coords(layer=('layer', layer_names))svi_detroit
# Define the colorbar limits (SVI is a 0-1 scale)vmin, vmax =0, 1# Create a multipanel plotfig, axes = plt.subplots(3, 2, figsize=(8, 10))axes = axes.flatten()# Plot each layerfor i, layer inenumerate(svi_detroit.layer.values):# Plot with custom color limits im = svi_detroit.sel(layer=layer).plot(ax=axes[i], add_colorbar=False, vmin=vmin, vmax=vmax, cmap='plasma') axes[i].set_title(layer)# Remove axes and labels axes[i].axis('off')# Plot Detroit metro boundary metro_reprojected.boundary.plot(ax=axes[i], color='red', linewidth=1)# Remove the extra subplotfig.delaxes(axes[5])# Add a single colorbarcbar_ax = fig.add_axes([0.92, 0.15, 0.02, 0.7])cbar = fig.colorbar(im, cax=cbar_ax, label='SVI Score', fraction=0.047, pad=0.04, aspect=20)plt.tight_layout()plt.show()
This provides an excellent look at demographic trends in the Detroit metro area. Overall vulnerability is highest in the inner city. The most striking drivers are the concentration of minorities and socioeconomic vulnerability in the downtown area. The housing and household components are slightly more varied throughout the region.
Knowledge Check
What does the Social Vulnerability Index (SVI) primarily measure?
Air pollution levels in different communities
Economic growth rates of different regions
Communities’ capacity to prepare for and respond to hazards
Population density in urban areas
Integrating TRI and SVI
There are numerous ways to assess the reationships between the SVI and TRI data. For this lesson we’ll identify areas where air releases and vulnerability are highest together by creating a new rasterized index that combines both layers.
To begin we need to rasterize the air release point data. We will create an empty raster grid and “fill” the cells with the sum of the air release values from any points the are overlapped by the cells.
Start by getting the bounds from our boundary object, specify the desired resolution, and set up the transform.
from rasterio.transform import from_origin# Get the bounds of the Detroit metro areaminx, miny, maxx, maxy = detroit_metro_bm.total_bounds# Define the resolution (1km/1000m) to match SVIresolution =5000# Calculate the number of cellsnx =int((maxx - minx) / resolution)ny =int((maxy - miny) / resolution)# Create the transform for the rastertransform = from_origin(minx, maxy, resolution, resolution)
Now we have to prepare the point values and rasterize them. A key thing to note here is the merge_alg=rasterio.enums.MergeAlg.add argument passed to the rasterize function. This makes sure that cells with more than 1 point will add the values together (the default is to simply replace).
from rasterio import features# Prepare geometries and values for rasterizationshapes = ((geom, value) for geom, value inzip(gdf_tri_form_r_bm.geometry, gdf_tri_form_r_bm.AIR_TOTAL_RELEASE))# Rasterize the point dataair_release_raster = features.rasterize(shapes=shapes, out_shape=(ny, nx), transform=transform, fill=0, all_touched=True, merge_alg=rasterio.enums.MergeAlg.add)
We rasterized with rasterio, but we need to convert this to an xarray object to continue and we need rioxarray to write the crs information for the xarray object.
import rioxarray# Convert the raster to an xarray DataArray# Note: We use ny and nx here to ensure the coordinates match the raster shapeair_release_raster_da = xr.DataArray(air_release_raster, coords={'y': np.linspace(maxy, miny, ny),'x': np.linspace(minx, maxx, nx)}, dims=['y', 'x'])air_release_raster_da.rio.write_crs(detroit_metro_bm.crs, inplace=True)
The spaces between county borders can create odd values because there are no points right next to each other. We can clip the raster to our Detroit boundary to fix this.
# Clip the raster with the Detroit metro boundaryair_release_raster_da = air_release_raster_da.rio.clip( detroit_metro_bm.geometry.values, detroit_metro_bm.crs, drop=False, all_touched=True)
Now we can see how our rasterized air release values look.
from matplotlib.colors import BoundaryNorm, ListedColormap# Define the breaks for the discrete scalebreaks = [0, 1, 10, 100, 1000, 10000, 100000, 250000, 500000]# Create a custom colormapcolors = ['#fffff3', '#FFFFCC', '#FFEDA0', '#FED976', '#FEB24C', '#FD8D3C', '#FC4E2A', '#E31A1C', '#B10026']cmap = ListedColormap(colors)# Create a normalization based on the breaksnorm = BoundaryNorm(breaks, cmap.N)# Create the plotfig, ax = plt.subplots(figsize=(7, 7))# Plot the TRI facility pointsgdf_tri_form_r_bm.plot(ax=ax, color='blue', markersize=10, alpha=0.7)# Plot the clipped raster with the custom colormap and normim = ax.imshow(air_release_raster_da, extent=[minx, maxx, miny, maxy], origin='upper', cmap=cmap, norm=norm)# Add colorbar with discrete labelscbar = plt.colorbar(im, ax=ax, extend='max', label='Total Air Releases (pounds)', ticks=breaks, fraction=0.047, pad=0.04, aspect=20)cbar.ax.set_yticklabels([f'{b:,}'for b in breaks])# Plot the Detroit metro boundarydetroit_metro_bm.boundary.plot(ax=ax, color='black', linewidth=2)# Set the extent to match the Detroit metro areaax.set_xlim(minx, maxx)ax.set_ylim(miny, maxy)# Add title and labelsax.set_title('TRI Air Total Release (100m resolution sum) with Facility Locations', fontsize=16)ax.set_xlabel('X Coordinate')ax.set_ylabel('Y Coordinate')plt.tight_layout()plt.show()print(f"Number of TRI facilities plotted: {len(gdf_tri_form_r_bm)}")print(f"Total air releases: {gdf_tri_form_r_bm['AIR_TOTAL_RELEASE'].sum():,.2f}")print(f"Maximum cell value in raster: {air_release_raster_da.max().values:,.2f}")
Number of TRI facilities plotted: 211
Total air releases: 2,117,167.80
Maximum cell value in raster: 434,344.84
The raster looks as expected with the highest values in locations where our previous map had the largest circles. We overlayed the points to see the sources of the values. The cell resolution is set to 5000m. This was selected so we can evaluate the map and the process we used to create the raster, but any value could be set depending on the intended use.
Air Release Vulnerability Index
Now that we’ve created our air release raster layer we can combine it with the SVI raster to create a new index identifying areas with the highest combination of air releases and vulnerability.
Start by selecting just the SVI Overall layer and converting it to a rioxarray object so we can perform raster calculations. We’ll also ensure the 2 raster layers “line up” by reprojecting into the same coordinate reference system and matching their resolutions.
from rasterio.enums import Resampling# Select the 'Overall' layersvi_overall = svi_detroit.sel(layer='Overall')# Convert to rioxarray for geospatial operationssvi_overall = svi_overall.rio.write_crs("EPSG:4326")# Reproject SVI to match the CRS of the air release rastersvi_reprojected = svi_overall.rio.reproject_match(air_release_raster_da)# Disaggregate the air release data to match the resolution of the SVI dataair_release_disaggregated = svi_reprojected.rio.reproject_match( svi_reprojected, resampling=Resampling.bilinear)
There are a few more steps that will aid in interpretation.
We will take the log of the air release data to reduce the impact of major outliers.
We will scale the logged air release data to 0-1 to match the SVI data. Now the resulting index we create will have equal contributions from both datasets.
# Log1p transform the air release data and scale to 0-1air_release_log = np.log1p(air_release_disaggregated)air_release_scaled = (air_release_log - air_release_log.min()) / (air_release_log.max() - air_release_log.min())# Multiply scaled air release data with SVI datavulnerability_indicator = air_release_scaled * svi_reprojected
Now let’s visualize our index along with the inputs.
# Create the plotsfig, axs = plt.subplots(3, 1, figsize=(8, 20))# Plot SVI Overallim1 = svi_reprojected.plot(ax=axs[0], cmap='viridis', vmin=0, vmax=1, add_colorbar=False)plt.colorbar(im1, ax=axs[0], label='SVI Overall', fraction=0.047, pad=0.04, aspect=20)axs[0].set_title('Social Vulnerability Index (Overall)', fontsize=16)detroit_metro.boundary.plot(ax=axs[0], color='black', linewidth=2)# Plot Original Air Release (log-transformed for better visualization)im2 = np.log1p(air_release_disaggregated).plot(ax=axs[1], cmap='YlOrRd', add_colorbar=False)plt.colorbar(im2, ax=axs[1], label='Log(Air Release + 1)', fraction=0.047, pad=0.04, aspect=20)axs[1].set_title('Air Release (Log-transformed)', fontsize=16)detroit_metro.boundary.plot(ax=axs[1], color='black', linewidth=2)# Plot Air Release Vulnerability Indicatorim3 = vulnerability_indicator.plot(ax=axs[2], cmap='YlOrRd', vmin=0, vmax=1, add_colorbar=False)plt.colorbar(im3, ax=axs[2], label='Air Release Vulnerability Indicator', fraction=0.047, pad=0.04, aspect=20)axs[2].set_title('Air Release Vulnerability Indicator\n(Scaled Air Release * SVI)', fontsize=16)detroit_metro.boundary.plot(ax=axs[2], color='black', linewidth=2)for ax in axs: ax.set_xlabel('Longitude') ax.set_ylabel('Latitude') ax.set_xlim(svi_reprojected.x.min(), svi_reprojected.x.max()) ax.set_ylim(svi_reprojected.y.min(), svi_reprojected.y.max())plt.tight_layout()plt.show()# Print some statisticsprint(f"Maximum vulnerability indicator: {vulnerability_indicator.max().values:.4f}")
Maximum vulnerability indicator: 0.9773
As one would expect, there are several areas in the downtown river corridor with very high combinations of SVI and TRI releases. The maximum score for our index is 0.9286! That is an extremely vulnerable community facing extraordinary levels of air pollution.
We can get a better sense of where these communities are by extracting the location of the cells with the highest scores and placing them on a basemap of Detroit.
We’ll convert the raster into a data frame, sort the values in descending order, and then extract the coordinates.
# Convert the vulnerability indicator to a pandas DataFramevulnerability_df = vulnerability_indicator.to_dataframe(name='index').reset_index()# Sort by index value and get the top 10top_10 = vulnerability_df.sort_values('index', ascending=False).head(10)# Create points from the coordinatestop_10['geometry'] = gpd.points_from_xy(top_10.x, top_10.y)top_10_gdf = gpd.GeoDataFrame(top_10, geometry='geometry', crs=vulnerability_indicator.rio.crs)
Now let’s make a map.
# Create the final mapfig, ax = plt.subplots(figsize=(8, 8))# Plot the Detroit metro boundarydetroit_metro_bm.boundary.plot(ax=ax, color='black', linewidth=2)# Plot the top 10 pointstop_10_gdf.plot(ax=ax, color='blue', markersize=100, alpha=0.7)# Add labels to the pointsfor idx, row in top_10_gdf.iterrows(): ax.annotate(f"#{idx+1}", (row.geometry.x, row.geometry.y), xytext=(3, 3), textcoords="offset points", color='black', fontweight='bold')# Add a basemapctx.add_basemap(ax, crs=vulnerability_indicator.rio.crs.to_string(), source=ctx.providers.OpenStreetMap.Mapnik)# Add a border to the mapbbox_polygon_bm.boundary.plot(ax=ax, color='#315c86', linewidth=3)# Set the extent to match the Detroit metro areaax.set_xlim(vulnerability_indicator.x.min(), vulnerability_indicator.x.max())ax.set_ylim(vulnerability_indicator.y.min(), vulnerability_indicator.y.max())ax.set_title('Top 10 Areas with Highest Air Release Vulnerability Index', fontsize=16)ax.set_axis_off()plt.tight_layout()plt.show()# Print the row index and index value of the top 10 pointsprint("Row index and index value of the top 10 points:")for i, (idx, row) inenumerate(top_10_gdf.iterrows(), 1):print(f"Row index = {idx}, Index value = {round(row['index'], 2)}")
Row index and index value of the top 10 points:
Row index = 347, Index value = 0.98
Row index = 134, Index value = 0.97
Row index = 155, Index value = 0.97
Row index = 306, Index value = 0.96
Row index = 369, Index value = 0.96
Row index = 305, Index value = 0.94
Row index = 285, Index value = 0.94
Row index = 330, Index value = 0.93
Row index = 348, Index value = 0.92
Row index = 324, Index value = 0.9
As you would expect looking at the input layers, the most vulnerable and polluted areas are along the river coridor in Loncoln Park, Melvindale, Mexicantown, and River Rouge. There are multiple locations with indexes greater than 0.70. Information like this can be valuable for targeted relief or mitigation programs.
Knowledge Check
What was the main purpose of combining the TRI and SVI data in this analysis?
To calculate total pollution levels for each county
To identify areas with both high air releases and high social vulnerability
To determine the most populous areas in Detroit
To predict future industrial development zones
CDC PLACES
The CDC PLACES (Population Level Analysis and Community Estimates) dataset is a collaboration between the Centers for Disease Control and Prevention (CDC), the Robert Wood Johnson Foundation, and the CDC Foundation. It provides model-based population-level analysis and community estimates of health indicators for all counties, places (incorporated and census designated places), census tracts, and ZIP Code Tabulation Areas (ZCTAs) across the United States. Some key points to consider when working with CDC PLACES:
Spatial Extent: Entire United States, including all 50 states, the District of Columbia, and Puerto Rico.
Spatial Resolution: Multiple levels including counties, cities/towns, census tracts, and ZIP codes.
Indicators: Wide range of chronic disease measures related to health outcomes, prevention, and health risk behaviors.
Data Sources:
Behavioral Risk Factor Surveillance System (BRFSS)
U.S. Census Bureau’s American Community Survey (ACS)
Methodology: Uses small area estimation methods for small geographic areas.
Health risk behaviors: e.g., smoking, physical inactivity, binge drinking
Prevention practices: e.g., health insurance coverage, dental visits, cholesterol screening
Socioeconomic Data: Includes some socioeconomic and demographic variables.
Annual Updates: Providing recent estimates for local areas.
This dataset is valuable for public health researchers, policymakers, and community organizations. It provides a standardized way to compare health indicators across different geographic areas and can be used to inform targeted interventions and policy decisions, especially in addressing health disparities at a local level.
As with ICIS-AIR and TRI, PLACES is available through a web search interface and a programmatic API. We will focus on the API in this example.
Processing
We’ll start by accessing the CDC PLACES data through their API and processing it for our analysis. First we set up our API request. We specify the endpoint URL, define our target counties, and create a filter to select data only for these counties in Michigan.
# Define the GeoJSON API endpointurl ="https://data.cdc.gov/resource/cwsq-ngmh.geojson"# Define the Detroit metro area countiesdetroit_counties = ['Wayne', 'Oakland', 'Macomb']# Create the county filter stringcounty_filter =" OR ".join([f"countyname = '{county}'"for county in detroit_counties])# Define the query parametersparams = {"$where": f"stateabbr = 'MI' AND ({county_filter})","$limit": 50000# Adjust if necessary}
Next, let’s make the API request and process the response:
# Make the API requestresponse = requests.get(url, params=params)if response.status_code ==200: data = response.json()print(f"Successfully retrieved data")else:print(f"Failed to retrieve data. Status code: {response.status_code}")print(response.text)# Convert to GeoDataFramegdf = gpd.read_file(response.text)
Successfully retrieved data
You can inspect the gdf return object on your own; it’s too large to print here. Let’s just take a look at the available health measures in CDC PLACES.
# Print available health measuresprint("\nAvailable health measures:")print(gdf['measure'].unique())# Print basic information about the GeoDataFrameprint("\nGeoDataFrame Info:")print(gdf.info())
Available health measures:
['Frequent physical distress among adults'
'Cancer (non-skin) or melanoma among adults'
'Visited dentist or dental clinic in the past year among adults'
'Cognitive disability among adults' 'Coronary heart disease among adults'
'Current asthma among adults'
'Current lack of health insurance among adults aged 18-64 years'
'Lack of social and emotional support among adults'
'Arthritis among adults' 'Obesity among adults'
'Self-care disability among adults' 'Mobility disability among adults'
'Short sleep duration among adults'
'Frequent mental distress among adults'
'Fair or poor self-rated health status among adults'
'Lack of reliable transportation in the past 12 months among adults'
'All teeth lost among adults aged >=65 years'
'Taking medicine to control high blood pressure among adults with high blood pressure'
'Colorectal cancer screening among adults aged 45–75 years'
'Vision disability among adults'
'Visits to doctor for routine checkup within the past year among adults'
'Diagnosed diabetes among adults'
'Mammography use among women aged 50-74 years'
'Independent living disability among adults'
'Hearing disability among adults' 'Stroke among adults'
'Chronic obstructive pulmonary disease among adults'
'Food insecurity in the past 12 months among adults'
'Feeling socially isolated among adults'
'High cholesterol among adults who have ever been screened'
'High blood pressure among adults'
'No leisure-time physical activity among adults'
'Cholesterol screening among adults'
'Housing insecurity in the past 12 months among adults'
'Utility services shut-off threat in the past 12 months among adults'
'Binge drinking among adults'
'Received food stamps in the past 12 months among adults'
'Current cigarette smoking among adults' 'Any disability among adults'
'Depression among adults']
GeoDataFrame Info:
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 46560 entries, 0 to 46559
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 measure 46560 non-null object
1 low_confidence_limit 46560 non-null object
2 data_value_unit 46560 non-null object
3 data_value 46560 non-null object
4 short_question_text 46560 non-null object
5 statedesc 46560 non-null object
6 totalpop18plus 46560 non-null object
7 locationid 46560 non-null object
8 countyname 46560 non-null object
9 year 46560 non-null object
10 high_confidence_limit 46560 non-null object
11 categoryid 46560 non-null object
12 stateabbr 46560 non-null object
13 data_value_footnote 0 non-null object
14 data_value_type 46560 non-null object
15 data_value_footnote_symbol 0 non-null object
16 locationname 46560 non-null object
17 category 46560 non-null object
18 datavaluetypeid 46560 non-null object
19 measureid 46560 non-null object
20 countyfips 46560 non-null object
21 datasource 46560 non-null object
22 totalpopulation 46560 non-null object
23 geometry 46560 non-null geometry
dtypes: geometry(1), object(23)
memory usage: 8.5+ MB
None
There’s a lot of great data here considering the rarity of health outcomes data; especially at this resolution (census tracts). Because we’re dealing with air quality in this lesson we’ll use “Current asthma among adults.”
Let’s make a quick map to see how it looks.
# Create a sample map for one health measure (e.g., Current asthma)fig, ax = plt.subplots(figsize=(7, 7))# Filter for the specific measure and ensure data_value is numericgdf_asthma = gdf[gdf['measure'] =='Current asthma among adults'].copy()gdf_asthma['data_value'] = pd.to_numeric(gdf_asthma['data_value'], errors='coerce')# Plot the asthma datagdf_asthma.plot(column='data_value', ax=ax, legend=True, legend_kwds={'label': 'Asthma Prevalence (%)', 'orientation': 'horizontal'}, cmap='YlOrRd', missing_kwds={'color': 'lightgrey'})# Add basemapctx.add_basemap(ax, crs=gdf_asthma.crs.to_string(), source=ctx.providers.OpenStreetMap.Mapnik)# Add a border to the mapbbox_polygon_bm.boundary.plot(ax=ax, color='#315c86', linewidth=3)# Set the extent to match the Detroit metro areaax.set_xlim(gdf_asthma.total_bounds[0], gdf_asthma.total_bounds[2])ax.set_ylim(gdf_asthma.total_bounds[1], gdf_asthma.total_bounds[3])plt.title('Asthma Prevalence in Detroit Metro Area', fontsize=16)ax.axis('off')plt.tight_layout()plt.show()
If you hadn’t already noticed while inspecting the object we returned from the API, the census tract data are actually centroid points–not census tract boundaries.
We created a choropleth map showing asthma prevalence across the Detroit metro area. With the YlOrRd colormap darker red indicates higher asthma prevalence.
Finally, let’s print some statistics about the asthma data:
# Print some statistics for the asthma dataprint("\nAsthma Statistics:")print(f"Average asthma prevalence: {gdf_asthma['data_value'].mean():.2f}%")print(f"Minimum asthma prevalence: {gdf_asthma['data_value'].min():.2f}%")print(f"Maximum asthma prevalence: {gdf_asthma['data_value'].max():.2f}%")# Print the number of census tracts per countyprint("\nNumber of census tracts per county:")print(gdf_asthma['countyname'].value_counts())
Asthma Statistics:
Average asthma prevalence: 12.06%
Minimum asthma prevalence: 7.20%
Maximum asthma prevalence: 17.20%
Number of census tracts per county:
countyname
Wayne 583
Oakland 345
Macomb 236
Name: count, dtype: int64
Wow. Some census tracts along the river front and central downtown area have adult asthma prevalance of nearly 20%! I’m not an epidimeologist, but that seems extremely high; especially when you compare to the outer suburbs and more affluent areas in the Royal Oak-Ferndale area. However, in light of our exploration of TRI and ICIS-AIR data it might not be surprising to see elevated asthma rates in these areas.
Given that we’ve mostly been working with and creating raster data in the previous sections, it might be easier to draw systematic comparisons between pollution and health outcomes if the CDC data is also in a raster format.
Interpolate a Surface w/ IDW
While the choropleth map provides a good overview, it can be affected by the varying sizes of census tracts. To get a smoother representation of the data that resembles our air quality data, we can use Inverse Distance Weighting (IDW) interpolation to create a continuous surface.
First, let’s prepare our data for interpolation by reprojecting, extracting the coordinates and the asthma value, and masking the data for any asthma values that may be NA.
from scipy.interpolate import griddatafrom rasterio.transform import from_originfrom rasterio.warp import transform_bounds# Ensure gdf_asthma is in EPSG:4326gdf_asthma = gdf_asthma.to_crs(epsg=4326)# Extract coordinates and valuesX = gdf_asthma.geometry.x.valuesY = gdf_asthma.geometry.y.valuesZ = gdf_asthma['data_value'].values# Remove any NaN valuesmask =~np.isnan(Z)X, Y, Z = X[mask], Y[mask], Z[mask]
Data Science Review
Inverse Distance Weighting (IDW) is a spatial interpolation method used to estimate values at unknown locations based on known values at nearby points. The fundamental principle of IDW is that points closer to the location being estimated have more influence than those farther away. IDW calculates the estimated value as a weighted average of nearby known values, with weights inversely proportional to the distance between the known point and the estimation location. IDW is popular in geographic information systems (GIS) and environmental sciences due to its simplicity and effectiveness in creating continuous surfaces from point data, such as elevation, temperature, or pollution levels. While IDW has advantages like preserving local variation and working well with evenly distributed points, it also has limitations; including potential “bull’s-eye” patterns around data points, inability to account for spatial trends or barriers, and sensitivity to outliers and clustering of input points.
Now we create the structure of the raster we’ll create. This is based on our desired resolution and the extent of our data.
# Create a grid to interpolate over# Resolution (degrees)grid_resolution =0.025# Extentx_min, y_min, x_max, y_max = gdf_asthma.total_bounds# Arrange the locations of the x and y dimensionsgrid_x = np.arange(x_min, x_max, grid_resolution)grid_y = np.arange(y_min, y_max, grid_resolution)# Mesh them to a gridgrid_xx, grid_yy = np.meshgrid(grid_x, grid_y)
This created a regular grid over our study area and then used the griddata function to perform linear interpolation (which is a form of IDW) on our asthma prevalence data.
Now, let’s convert our interpolated data into an xarray Dataset so it’s easier to manipulate.
# Create an xarray Dataset from the interpolated datads = xr.Dataset({'asthma': (['y', 'x'], grid_z),'x': grid_x,'y': grid_y})# Set the correct CRSds.rio.write_crs("EPSG:4326", inplace=True)
<xarray.Dataset>
Dimensions: (y: 33, x: 38)
Coordinates:
* x (x) float64 -83.67 -83.64 -83.62 ... -82.79 -82.77 -82.74
* y (y) float64 42.06 42.09 42.11 42.14 ... 42.79 42.81 42.84 42.86
spatial_ref int32 0
Data variables:
asthma (y, x) float64 nan nan nan nan nan nan ... nan nan nan nan nan
This step allows us to work with our interpolated data using xarray, which provides powerful tools for working with multi-dimensional arrays.
Next, we’ll clip our interpolated surface to the Detroit metro area. This ensures our data matches our other rasters. Then we’ll project it to match basemaps so we can create a map of the data.
# Clip the dataset using the detroit_metro polygonds_clipped = ds.rio.clip(detroit_metro.geometry, drop=False)# Reproject to Web Mercator (EPSG:3857)ds_3857 = ds_clipped.rio.reproject("EPSG:3857")
Finally, let’s create a map of our interpolated asthma prevalence:
# Create the plotfig, ax = plt.subplots(figsize=(7, 7))# Plot the interpolated and clipped dataim = ds_3857.asthma.plot(ax=ax, cmap='YlOrRd', alpha=0.7, add_colorbar=False)# Add colorbar with adjusted heightcbar = plt.colorbar(im, ax=ax, label='Asthma Prevalence', fraction=0.047, pad=0.04, aspect=20)# Add basemapctx.add_basemap(ax, crs=ds_3857.rio.crs, source=ctx.providers.OpenStreetMap.Mapnik)# Add a border to the mapbbox_polygon_bm.boundary.plot(ax=ax, color='#315c86', linewidth=3)# Set the extent to match the Detroit metro areabounds_3857 = transform_bounds("EPSG:4326", "EPSG:3857", x_min, y_min, x_max, y_max)ax.set_xlim(bounds_3857[0], bounds_3857[2])ax.set_ylim(bounds_3857[1], bounds_3857[3])plt.title('IDW Interpolated Asthma Prevalence in Detroit Metro Area', fontsize=16)ax.axis('off')plt.tight_layout()plt.show()
This code creates a map of the interpolated asthma prevalence, providing a smooth surface that shows the spatial variation in asthma rates across the Detroit metro area. The use of IDW interpolation helps to create a continuous surface from our discrete census tract data, potentially revealing patterns that might be less apparent in the choropleth map.
By comparing this interpolated surface with our earlier maps of air pollution and social vulnerability, we can start to explore potential relationships between environmental factors, social conditions, and health outcomes in the Detroit metro area.
We can start with a simple correlation analysis and then visualize it with a scatterplot. The simpleist way to perform the analysis is the extract the raw value pairs and calculate the Pearson correlation.
from scipy import stats# Ensure both rasters have the same shape and are alignedair_release_aligned = air_release_raster_da.rio.reproject_match(ds_clipped)# Flatten the arrays and remove NaN valuesair_release_flat = air_release_aligned.values.flatten()# Take the log of the release values to account for extreme outliersair_release_flat = np.log1p(air_release_flat)asthma_flat = ds_clipped.asthma.values.flatten()mask =~np.isnan(air_release_flat) &~np.isnan(asthma_flat)correlation, p_value = stats.pearsonr(air_release_flat[mask], asthma_flat[mask])print(f"Correlation coefficient: {correlation}")print(f"P-value: {p_value}")
Creating a scatterplot of the paired values can provide a better look at what’s going on.
# Create the empty mapplt.figure(figsize=(7, 7))# Create the scatter plot of the valuesplt.scatter(air_release_flat[mask], asthma_flat[mask], alpha=0.5)# Create Axis Labelsplt.xlabel('TRI Air Releases')plt.ylabel('Asthma Prevalence')# Titleplt.title('TRI Air Releases vs Asthma Prevalence')# Let's see itplt.show()
This paints a clearer picture. Although there is a modest positive correlation, there are numerous cells with 0 TRI Air Release and a wide variety of Asthma prevelance values, which negates any correlation we might be seeing at higher release levels. This is a simple pixel by pixel comparison of matching values. We can perform a more robust analysis accounting for the spatial strucuture using a global autocorrelation analysis and Moran’s I.
Pearson’s correlation and Moran’s I are both measures of association, but they serve different purposes and account for different aspects of data. Pearson’s correlation measures the linear relationship between two variables without considering their spatial context. It simply looks at how two variables change together, regardless of where the data points are located in space. In contrast, Moran’s I is specifically designed for spatial data and measures spatial autocorrelation – the degree to which a variable is correlated with itself across geographic space. Moran’s I takes into account both the values of observations and their spatial relationships, typically using a spatial weights matrix. While Pearson’s correlation can tell you if two variables are related overall, Moran’s I can reveal whether high or low values of a variable tend to cluster together in space, or if they’re randomly distributed. This makes Moran’s I particularly useful for analyzing geographical patterns and spatial dependencies, which are crucial in fields like environmental science, epidemiology, and urban planning.
We can perform this analysis in Python use pysal. We already know our data is spatially aligned from the previous code chunks. We can create a spatial data frame and continue on.
from pysal.explore import esdafrom pysal.lib import weightsimport shapely# Convert rasters to a GeoDataFramedf = gpd.GeoDataFrame( {'air_release': air_release_aligned.values.flatten(),'asthma': ds_clipped.asthma.values.flatten(),'geometry': [ shapely.geometry.Point(x, y) for x, y inzip( np.repeat(air_release_aligned.x.values, len(air_release_aligned.y)), np.tile(air_release_aligned.y.values, len(air_release_aligned.x)) ) ] })
Statistics Review
A p-value, or probability value, is a statistical concept used to determine the significance of results in hypothesis testing. It represents the probability of obtaining results at least as extreme as the observed results, assuming that the null hypothesis is true. In simpler terms, it measures the strength of evidence against the null hypothesis. P-values range from 0 to 1, with lower values indicating stronger evidence against the null hypothesis. Typically, a p-value below a predetermined threshold (often 0.05) is considered statistically significant, suggesting that the observed results are unlikely to have occurred by chance alone. However, it’s important to note that while p-values can indicate the strength of evidence against the null hypothesis, they do not prove or disprove hypotheses, nor do they measure the size or importance of an effect. They should be interpreted in conjunction with other factors such as effect size, sample size, and practical significance.
Check for missing values and create the spatial weights matrix that determines how wide of a net we want to cast as we search for a local dependence structure (you could change these values and see how it impacts the results). You can read more about settings for spatial autocorrelation and Moran’s I analysis here.
# Remove any rows with NaN valuesdf = df.dropna()# Create a spatial weights matrixw = weights.distance.KNN.from_dataframe(df, k=8) # Using 8 nearest neighborsw.transform ='r'# Row-standardize the weights
Now we’ll cycle through the Moran function for air release, current asthma among adults, and the 2 combined (bivariate). First air releases.
# Create a Moran scatter plot for air releasefrom splot.esda import moran_scatterplot# Calculate Moran's I for air releasemoran_air = esda.Moran(df['air_release'], w)print("Moran's I for Air Release:")print(f"I: {moran_air.I}")print(f"p-value: {moran_air.p_sim}")fig, ax = plt.subplots(figsize=(7, 7))moran_scatterplot(moran_air, ax=ax)# Get the current limitsxlim = ax.get_xlim()ylim = ax.get_ylim()# Find the min and max of both axesvmin =min(xlim[0], ylim[0])vmax =max(xlim[1], ylim[1])# Set both axes to the same limitsax.set_xlim(vmin, vmax)ax.set_ylim(vmin, vmax)# Ensure the aspect ratio is 1:1ax.set_aspect('equal', adjustable='box')ax.set_title("Moran Scatter Plot: Air Release")plt.show()
Moran's I for Air Release:
I: 0.11614142149447003
p-value: 0.001
When evaluating Moran’s I for single variables we are assessing how clustered like values are; i.e. do high or low values of air release tend to appear next to each other. On a scale of -1 to 1 a 0.12 suggests a slight positive correlation. High values of air release tend to cluster in similar spots. That said the correlation is lower and we can the same feature as before in that there are a large amount of 0 values that are mixed in all throughout the data. You could attempt to address this issue by using a larger resolution when creating the raster layer depicting air release sums (we used 5000m), or by including different sources of air pollution.
# Calculate Moran's I for asthma prevalencemoran_asthma = esda.Moran(df['asthma'], w)print("\nMoran's I for Asthma Prevalence:")print(f"I: {moran_asthma.I}")print(f"p-value: {moran_asthma.p_sim}")# Create a Moran scatter plot for asthma prevalencefig, ax = plt.subplots(figsize=(7, 7))moran_scatterplot(moran_asthma, ax=ax)# Get the current limitsxlim = ax.get_xlim()ylim = ax.get_ylim()# Find the min and max of both axesvmin =min(xlim[0], ylim[0])vmax =max(xlim[1], ylim[1])# Set both axes to the same limitsax.set_xlim(vmin, vmax)ax.set_ylim(vmin, vmax)# Ensure the aspect ratio is 1:1ax.set_aspect('equal', adjustable='box')ax.set_title("Moran Scatter Plot: Asthma Prevalence")plt.show()
Moran's I for Asthma Prevalence:
I: 0.5179466646931635
p-value: 0.001
Asthma prevelance has a much stronger correlation structure. High asthma levels are clustered together spatially and have a value of 0.52. Also note that the scatterplot shows a nice fit without any odd artifacts or outliers.
# Calculate bivariate Moran's Imoran_bv = esda.Moran_BV(df['air_release'], df['asthma'], w)print("\nBivariate Moran's I (Air Release vs Asthma):")print(f"I: {moran_bv.I}")print(f"p-value: {moran_bv.p_sim}")# Create a bivariate Moran scatter plotfig, ax = plt.subplots(figsize=(7, 7))# make the plotmoran_scatterplot(moran_bv, ax=ax)# Get the current limitsxlim = ax.get_xlim()ylim = ax.get_ylim()# Find the min and max of both axesvmin =min(xlim[0], ylim[0])vmax =max(xlim[1], ylim[1])# Set both axes to the same limitsax.set_xlim(vmin, vmax)ax.set_ylim(vmin, vmax)# Ensure the aspect ratio is 1:1ax.set_aspect('equal', adjustable='box')# Set the titleax.set_title("Bivariate Moran Scatter Plot: Air Release vs Asthma")plt.show()
Bivariate Moran's I (Air Release vs Asthma):
I: 0.10554510298823498
p-value: 0.001
Similar to the pearson analysis, when we analyze their combined spatial dependence the relatipnship is not as strong; largely in part to the high levels of 0 air release values. That said the relationship is positive and it is statistically significant (p-value 0.001).
This could be due to several factors that went unaccounted for in our simple analysis.
Our choice of size/resolution when creating our summed air release raster.
The impacts of TRI air releases could be much more widespread than the 5km resolution we specified.
We only analyze TRI regulated facilities and their emissions.
We did not account for traffic density or other sources of particulate matter emissions.
We did not control for socioeconomic status, prevantative healthcare access, and a whole host of other confounding variables.
There is no underlying relationship. Unlikely based on large amounts of peer reviewed literature, but a possibility none the less.
Knowledge Check
What type of data does the CDC PLACES dataset provide?
Real-time air quality measurements
Estimates of various health indicators at local levels
Hospital admission rates for air pollution-related illnesses
Locations of healthcare facilities in each county
Conclusion
This lesson provided a comprehensive exploration of various environmental and public health datasets, focusing on the Detroit metropolitan area. By analyzing data from ICIS-AIR, the Toxic Release Inventory (TRI), the Social Vulnerability Index (SVI), and CDC PLACES, we gained insights into the complex relationships between industrial air pollution, social vulnerability, and health outcomes. The integration of these datasets, combined with geospatial analysis techniques, allowed us to identify areas where environmental hazards and social vulnerabilities intersect, highlighting potential environmental justice concerns. This approach demonstrates the power of combining diverse datasets and utilizing geospatial tools to inform policy decisions and target interventions in areas of greatest need.
Key Learning Points
Congratulations! In this lesson you:
Accessed and processed data from multiple EPA and CDC APIs, including ICIS-AIR, TRI, and PLACES datasets
Created and manipulated geospatial objects using libraries such as GeoPandas and rasterio
Visualized point data on maps using Matplotlib and Contextily
Performed data cleaning and preprocessing, including handling missing coordinates and outliers
Created choropleth maps to visualize health outcomes data at the census tract level
Utilized Inverse Distance Weighting (IDW) interpolation to create continuous surfaces from point data
Integrated multiple datasets to create a composite index (Air Release Vulnerability Index)
Analyzed spatial patterns of air pollution, social vulnerability, and health outcomes
Applied raster math and resampling techniques to align and compare different spatial datasets
Interpreted results in the context of environmental justice and public health concerns
Perform a Pearson and Moran’s I correlation analysis
Social Vulnerability Index
NASA’s Social Vulnerability Index (SVI) raster dataset, available through the Socioeconomic Data and Applications Center (SEDAC), is a high-resolution geospatial resource that quantifies social vulnerability across the United States. This dataset is based on the CDC’s Social Vulnerability Index but is presented in a raster format, providing continuous coverage at a 250-meter resolution. The SVI incorporates various socioeconomic and demographic factors such as poverty, lack of vehicle access, crowded housing, and minority status to assess communities’ capacity to prepare for, respond to, and recover from hazards, including environmental threats like poor air quality.
The raster format allows for more detailed spatial analysis and integration with other environmental datasets. This makes it particularly valuable for researchers and policymakers studying the intersection of social vulnerability and environmental risks, such as air pollution exposure. By overlaying this SVI data with air quality information, for instance, analysts can identify areas where socially vulnerable populations may be disproportionately affected by poor air quality, supporting environmental justice initiatives and targeted intervention strategies.
SEDAC, as part of NASA’s Earth Observing System Data and Information System (EOSDIS), hosts this dataset along with other socioeconomic and environmental data, facilitating interdisciplinary research on human-environment interactions. The SVI raster dataset’s high resolution and comprehensive coverage make it a powerful tool for assessing environmental equity and informing policy decisions at various geographic scales.
Although the SVI dataset is free to acquire, it does require an Earth Data account. If you don’t already have an Earth Data account you can follow these steps to download the SVI dataset on your local computer:
Remember, your Earth Data account gives you access not just to SEDAC, but to a wide range of NASA Earth science data. It’s a valuable resource for researchers, students, and anyone interested in environmental and socioeconomic data.
Data Processing
Once you’ve downloaded the dataset to your working directory, you can proceed with the analysis. We’ll be using the 2020 census tract version of SVI.
The different layers of SVI are provided as individual files, but sometimes it’s easier to work with a multilayer object. We can create one using xarray. To begin, we’ll read in each file individually, clip it to our border of Detroit metro, and create an individual data array.
Now we can combine them together and give the layers “pretty” names.
<xarray.DataArray (layer: 5, y: 105, x: 119)> array([[[-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], ..., [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38]], [[-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], ... [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38]], [[-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], ..., [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38]]], dtype=float32) Coordinates: * y (y) float64 42.9 42.89 42.88 42.87 ... 42.05 42.05 42.04 42.03 * x (x) float64 -83.69 -83.68 -83.67 -83.66 ... -82.72 -82.71 -82.7 * layer (layer) <U13 'Overall' 'Minority' ... 'Housing' 'Household' Attributes: crs: EPSG:4326 transform: | 0.01, 0.00,-83.69|\n| 0.00,-0.01, 42.90|\n| 0.00, 0.00, 1.00|Now we can plot each layer.
This provides an excellent look at demographic trends in the Detroit metro area. Overall vulnerability is highest in the inner city. The most striking drivers are the concentration of minorities and socioeconomic vulnerability in the downtown area. The housing and household components are slightly more varied throughout the region.
What does the Social Vulnerability Index (SVI) primarily measure?